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Abstract 

We consider the problem of adaptive stratified sampling for Monte Carlo integra- 
tion of a differentiable function given a finite number of evaluations to the func- 
tion. We construct a sampling scheme that samples more often in regions where 
the function oscillates more, while allocating the samples such that they are well 
spread on the domain (this notion shares similitude with low discrepancy). We 
prove that the estimate returned by the algorithm is almost similarly accurate as 
the estimate that an optimal oracle strategy (that would know the variations of the 
function everywhere) would return, and provide a finite-sample analysis. 



1 Introduction 

In this paper we consider the problem of numerical integration of a differentiable function / : 
[0, 1]'' — > M given a finite budget n of evaluations to the function that can be allocated sequentially. 

A usual technique for reducing the mean squared error (w.rt. the integral of /) of a Monte-Carlo es- 
timate is the so-called stratified Monte Carlo sampling, which considers sampling into a set of strata, 
or regions of the domain, that form a partition, i.e. a stratification, of the domain (see |9| [Subsection 
5.5] or |5|). It is efficient (up to rounding issues) to stratify the domain, since when allocating to 
each stratum a number of samples proportional to its measure, the mean squared error of the result- 
ing estimate is always smaller or equal to the one of the crude Monte-Carlo estimate (that samples 
uniformly the domain). 

Since the considered functions are differentiable, if the domain is stratified in K hyper-cubic strata of 
same measure and if one assigns uniformly at random n/K samples per stratum, the mean squared 
error of the resulting stratified estimate is in 0{n^^ K^^^'^). We deduce that if the stratification 
is built independently of the samples (before collecting the samples), and if n is known from the 
beginning (which is assumed here), the minimax-optimal choice for the stratification is to build n 
strata of same measure and minimal diameter, and to assign only one sample per stratum uniformly 
at random. We refer to this sampling technique as Uniform stratified Monte-Carlo. The resulting 
estimate has a mean squared error of order 0(n^(^+^/''^). The arguments that advocate for strati- 
fying in strata of same measure and minimal diameter are closely linked to the reasons why quasi 
Monte-Carlo methods, or low discrepancy sampling schemes are efficient techniques for integrating 
smooth functions. See 151 for a survey on these techniques. 

It is minimax-optimal to stratify the domain in n strata and sample one point per stratum, but it 
would also be interesting to adapt the stratification of the space with respect to the function /. For 
example, if the function has larger variations in a region of the domain, we would like to discretize 
the domain in smaller strata in this region, so that more samples are assigned to this region. Since 
/ is initially unknown, it is not possible to design a good stratification before sampling. However 
an efficient algorithm should allocate the samples in order to estimate online the variations of the 
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function in each region of the domain while, at the same time, allocating more samples in regions 
where / has larger local variations. 

The papers f4','6','2l provide algorithms for solving a similar trade-off when the stratification is fixed: 
these algorithms allocate more samples to strata in which the function has larger variations. It is, 
however, clear that the larger the number of strata, the more difficult it is to allocate the samples 
almost optimally in the strata. 

Contributions: We propose a new algorithm, Lipschitz Monte-Carlo Upper Confidence Bound 
(LMC-UCB), for tackling this problem. It is a two-layered algorithm. It first stratifies the domain 
in K <^ n strata, and then allocates uniformly to each stratum an initial small amount of samples 
in order to estimate roughly the variations of the function per stratum. Then our algorithm sub- 
stratifies each of the K strata according to the estimated local variations, so that there are in total 
approximately n sub-strata, and allocates one point per sub-stratum. In that way, our algorithm 
discretizes the domain into more refined strata in regions where the function has higher variations. 
It cumulates the advantages of quasi Monte-Carlo and adaptive strategies. 

More precisely, our contributions are the following: 

• We prove an asymptotic lower bound on the mean squared error of the estimate returned by 
an optimal oracle strategy that has access to the variations of the function / everywhere and 
would use the best stratification of the domain with hyper-cubes (possibly of heterogeneous 
sizes). This quantity, since this is a lower-bound on any oracle strategies, is smaller than 
the mean squared error of the estimate provided by Uniform stratified Monte-Carlo (which 
is the non-adaptive minimax-optimal strategy on the class of differentiable functions), and 
also smaller than crude Monte-Carlo. 

• We introduce the algorithm LMC-UCB, that sub-stratifies the K strata in hyper-cubic sub- 
strata, and samples one point per sub-stratum. The number of sub-strata per stratum is 
linked to the variations of the function in the stratum. We prove that algorithm LMC-UCB 
is asymptotically as efficient as the optimal oracle strategy. We also provide finite-time 
results when / admits a Taylor expansion of order 2 in every point. By tuning the number 
of strata K wisely, it is possible to build an algorithm that is almost as efficient as the 
optimal oracle strategy. 

The paper is organized as follows. Section |2] defines the notations used throughout the paper. Sec- 
tion [3] states the asymptotic lower bound on the mean squared error of the optimal oracle strategy. 
In this Section, we also provide an intuition on how the number of samples into each stratum should 
be linked to the variation of the function in the stratum in order for the mean squared error of the 
estimate to be small. Section|4]presents the algorithm LMC-UCB and the first Lemma on how many 
sub-strata are built in the initial strata. Section [5] finally states that the algorithm LMC-UCB is al- 
most as efficient as the optimal oracle strategy. We finally conclude the paper Due to the lack of 
space, we also provide experiments and proofs in the Supplementary Material. 

2 Setting 

We consider a function / : [0, l]'^ — E. We want to estimate as accurately as possible its integral 
according to the Lebesgue measure, i.e. Jj^ ^^j^ f{x)dx. In order to do that, we consider algorithms 
that stratify the domain in two layers of strata, one more refined than the other. The strata of the 
refined layer are referred to as sub-strata, and we sample in the sub-strata. We will compare the 
performances of the algorithms we construct, with the performances of the optimal oracle algorithm 
that has access to the variations ||V/(x)||2 of the function / everywhere in the domain, and is 
allowed to sample the domain where it wishes. 

The first step is to partition the domain [0, 1]^ in K measurable strata. In this paper, we assume 
that K^/'^ is an intege||] This enables us to partition, in a natural way, the domain in K hyper-cubic 
strata {0,k)k<K of same measure Wk = j^- Each of these strata is a region of the domain [0, 1]'', 
and the K strata form a partition of the domain. We write IJ-k = 1;^ Jq^ f{x)dx the mean and 

"^fc ~ -f^k (•^(^■' ^ ^^k)''dx the variance of a sample of the function / when sampling / at a point 
chosen at random according to the Lebesgue measure conditioned to stratum fi^. 

' This is not restrictive in small dimension, but it may become more constraining for large d. 
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We possess a budget of n samples (which is assumed to be known in advance), which means that 
we can sample n times the function at any point of [0, 1]'*. We denote by A an algorithm that 
sequentially allocates the budget by sampling at round t in the stratum indexed by fc^ e {1, . . . , K}, 
and returns after all n samples have been used an estimate fin of the integral of the function /. 

We consider strategies that sub-partition each stratum in hyper-cubes of same measure in flk, but 
of heterogeneous measure among the 17^. In this way, the number of sub-strata in each stratum ilj, 
can adapt to the variations / within ilk- The algorithms that we consider return a sub-partition of 
each stratum fifc in 5*^ sub-strata. We call A/fc = {^k,i)i<Sk '^he sub-partition of stratum $7^. In each 
of these sub-strata, the algorithm allocates at least one poinj^ We write Xk,i the first point sampled 
uniformly at random in sub-stratum ftk,i- We write ■Wk,i the measure of the sub-stratum ilk,i- Let us 

write ^k,i = /j^^ . f{x)dx the mean and al^ = j^^ ^ {fix) - ^ikA^dx the variance of a 

sample of / in sub-stratum fi^^i (e.g. of X^^^ = f{Uk,i) where Uk.i ~ l^n^j)- 
This class of 2— layered sampling strategies is rather large. In fact it contains strategies that are 
similar to low discrepancy strategies, and also to any stratified Monte-Carlo strategy. For example, 
consider that all K strata are hyper-cubes of same measure ^ and that each stratum fi^ is partitioned 
into Sk hyper-rectangles ^k,i of minimal diameter and same measure j^g^- If the algorithm allocates 
one point per sub-stratum, its sampling scheme shares similarities with quasi Monte-Carlo sampling 
schemes, since the points at which the function is sampled are well spread. 

Let us now consider an algorithm that first chooses the sub-partition {JVk)k and then allocates de- 
terministically 1 sample uniformly at random in each sub-stratum il.k,i. We consider the stratified 

estimate fin — J2k=i ^i=i ^^^-^ks of fi. We have 

K Sk Sk p p 

E(A") = -5^t^k,t = y y / f{x)dx = / f{x)dx = Ai, 



fe=l 1=1 k<K 1=1 



and also Sk Sk 2 

k<K 1=1 k<K i=l ^ 



For a given algorithm A that builds for each stratum k a sub-partition Nk = {^k.i)i<Sk' '-^H 

Sk 2 



pseudo-risk the quantity a 

k ^2 



k<K i=l 



k 



Some further insight on this quantity is provided in the paper ||3l. 

Consider now the uniform strategy, i.e. a strategy that divides the domain in K — n hyper-cubic 
strata. This strategy is a fairly natural, minimax-optimal static strategy, on the class of differentiable 
function defined on [0, 1]"^, when no information on / is available. We will prove in the next Section 
that its asymptotic mean squared error is equal to 

^( / l|V/(:.)||^dx)^. 

This quantity is of order n^^^^^'^, which is smaller, as expected, than 1/n: this strategy is more 
efficient than crude Monte-Carlo. 

We will also prove in the next Section that the minimum asymptotic mean squared error of an 
optimal oracle strategy (we call it "oracle" because it builds the stratification using the information 
about the variations 1 1 V/(x) 1 12 of / in every point a:;), is larger than 

/ miixm^dx) " 

-•^^ ^ -'[0,1]'^ ^ n'^ d 

This quantity is always smaller than the asymptotic mean squared error of the Uniform stratified 
Monte-Carlo strategy, which makes sense since this strategy assumes the knowledge of the variations 
of / everywhere, and can thus adapt accordingly the number of samples in each region. We define 

s-^f/ mfixm^dxY''^. (2) 



12 



[0,1]' 



^This implies that Sk < n. 
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Given this minimum asymptotic mean squared error of an optimal oracle strategy, we define the 
pseudo-regret of an algorithm A as 

i?„(^)-L„(^)~E-^. (3) 

This pseudo-regret is the difference between the pseudo-risk of the estimate provided by algorithm 
A, and the lower-bound on the optimal oracle mean squared error. In other words, this pseudo-regret 
is the price an adaptive strategy pays for not knowing in advance the function /, and thus not having 
access to its variations. An efficient adaptive strategy should aim at minimizing this gap coming 
from the lack of informations. 

3 Discussion on the optimal asymptotic mean squared error 

3.1 Asymptotic lower bound on the mean squared error, and comparison witli the Uniform 
stratified Monte-Carlo 

A first part of the analysis of the exposed problem consists in finding a good point of comparison 
for the pseudo-risk. The following Lemma states an asymptotic lower bound on the mean squared 
error of the optimal oracle sampling strategy. 

Lemma 1 Assume that f is such that V/ is continuous and J || V/(a;)| jjdx < oo. Let ((^^fe)fc<n)„ 
be an arbitrary sequence of partitions of [0, 1]'' in n strata such that all the strata are hyper-cubes, 
and such that the maximum diameter of each stratum goes to as n — >■ +oo (but the strata are 
allowed to have heterogeneous measuresj.Let fin be the stratified estimate of the function for the 
partition {^'^)k<n when there is one point pulled at random per stratum. Then 

lim inf ni+2/''V(A„) > S- 

n— f oo 

The full proof of this Lemma is in the Supplementary Material, Appendix [B] 

We have also the following equality for the asymptotic mean squared error of the uniform strategy. 

Lemma 2 Assume that f is such that V/ is continuous and J \ | V/(a;) \\2dx < 00. For any n = 
such that I is an integer ( and thus such that it is possible to partition the domain in n hyper-cubic 
strata of same measure), define ((r25!)j.<„)^ as the sequence of partitions in hyper-cubic strata of 
same measure 1/n. Let fin be the stratified estimate of the function for the partition (r25J)fc<n when 
there is one point pulled at random per stratum. Then 

lim inf n'+^/^Y{f,n) = ^( [ f {x)\\ldx) . 

The proof of this Lemma is substantially similar to the proof of Lemma [T] in the Supplementary 
Material, Appendix [b| The only difference is that the measure of each stratum ilj! is and that in 
Step 2, instead of Fatou's Lemma, the Theorem of dominated convergence is required. 

The optimal rate for the mean squared error, which is also the rate of the Uniform stratified Monte- 
Carlo in Lemma|2] is n~^~'^/'^ and is attained with ideas of low discrepancy sampling. The constant 
can however be improved (with respect to the constant in Lemma |2]i, by adapting to the specific 
shape of each function. In Lemma [T] we exhibit a lower bound for this constant (and without 

surprises, Jjg ||V/(a;)||2rfa;^ > S). Our aim is to build an adaptive sampling scheme, also 
sharing ideas with low discrepancy sampling, that attains this lower-bound. 

There is one main restriction in both Lemma: we impose that the sequence of partitions ( ) ^ 
is composed only with strata that have the shape of an hyper-cube. This assumption is in fact 
reasonable: indeed, if the shape of the strata could be arbitrary, one could take the level sets (or 
approximate level sets as the number of strata is limited by n) as strata, and this would lead to 
lim„_j.oo inf SI ?^^^^^'*V(A^i.sl) — 0. But this is not a fair competition, as the function is unknown, 
and determining these level sets is actually a much harder problem than integrating the function. 
The fact that the strata are hyper-cubes appears, in fact, in the bound. If we had chosen other shapes, 
e.g. I2 balls, the constant j2 front of the bounds in both Lemma would chang^ It is however not 

^The comes from computing the variance of an uniform random variable on [0, 1]. 
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possible to make a finite partition in I2 balls of [0, 1]'', and we chose hyper-cubes since it is quite 
easy to stratify [0, 1]'' in hyper-cubic strata. 



The proof of Lemma 



makes the quantity s* (x) = i\\^f(^)\\2) ''+^^ — appear This quantity is 

J iio.i]'imfM\\2)^du 

proposed as "asymptotic optimal allocation", i.e. the asymptotically optimal number of sub-strata 
one would ideally create in any small sub-stratum centered in x. This is however not very useful for 
building an algorithm. The next Subsection provides an intuition on this matter. 

3.2 An intuition of a good allocation: Piecewise linear functions 

In this Subsection, we (i) provide an example where the asymptotic optimal mean squared error is 
also the optimal mean squared error at finite distance and (ii) provide explicitly what is, in that case, 
a good allocation. We do that in order to give an intuition for the algorithm that we introduce in the 
next Section. 

We consider a partition in K hyper-cubic strata ftk- Let us assume that the function / is affine on all 
strata flk, i.e. on stratum ilk, we have f{x) = (j^Ok, x) + pk^I{x G ilfc}- In that case fXk — f{ak) 
where ak is the center of the stratum ftk- We then have: 

We consider also a sub-partition of fife in hyper-cubes of same size (we assume that S)J'^ is 
an integer), and we assume that in each sub-stratum ilfc i, we sample one point. We also have 



-Wi. 



\m\l(w^\'^i<i 



(ff") for sub-stratum f2 



k.i- 

For a given k and a given 5^, all the Cfc ^ are equals. The pseudo-risk of an algorithm A that divides 
each stratum in Sk sub-strata is thus 

r C /l^ - Wfc Wk\\2 (Wk^1ld _ V- Wk Il^fc|l2 _ V- _2 

k<Ki<Sk k<K '-'k k<K ^k 

If an unadaptive algorithm A* has access to the variances cr^ in the strata, it can choose to allocate 
the budget in order to minimize the pseudo-risk. After solving the simple optimization problem 
of minimizing Ln{A) with respect to {Sk)k, we deduce that an optimal oracle strategy on this 

d 

E.<if(«'.'^.)3+T 

for this strategy is then 



stratification would divide each stratum fc in S*^ = — ('"kCk) sub-strat£ 



The pseudo-risk 



where we write S/^ = J2i<K('^i'^i)^^ ■ ^^^^ ''^^^ paper optimal proportions the quantities 

XK,k = ^ ' (5) 

In the specific case of functions that are piecewise linear, we have Y,k — J2k<Ki''^kO'k)^+^ — 
S.<K(-.^-y')^ - /[o 1]^ ^^^g^^rf- We thus have 

— ^VO L^i-LJ 2(d + l) 

Ln,K{A*) = ^^-T- (6) 

This optimal oracle strategy attains the lower bound in Lemma[T] We will thus construct, in the next 
Section, an algorithm that learns and adapts to the optimal proportions defined in Equation |5] 

''We deliberately forget about rounding issues in this Subsection. The allocation we provide might not be 
realizable (e.g. if SI is not an integer), but plugging it in the bound provides a lower bound on any realizable 
performance. 
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4 The Algorithm LMC-UCB 
4.1 Algorithm LMC-UCB 

We present the algorithm Lipschitz Monte Carlo Upper Confidence Bound (LM C — UCB). It takes 
as parameter a partition {^k)k<K m K <n hyper-cubic strata of same measure 1/ K (it is possible 
since we assume that 3^ S N//'' = K). It also takes as parameter an uniform upper bound L on 
1 1 V/(a;) 1 12, and 6, a (small) probability. The aim of algorithm LMC — UCB is to sub-stratify each 

d 

Stratum flk in k — — i'"k<rk)''+'- ^ ^ hyper-cubic sub-strata of same measure and sample one 
point per sub-stratum. An intuition on why this target is relevant was provided in Section [3] 

id 



Algorithm LMC-UCB starts by sub-stratifying each stratum flk in S ~ 



kJ 



l/d 



hyper- 



cubic strata of same measure. It is possible to do that since by definition, S"^/"^ is an integer We 
write this first sub-stratification Af^. = {fl'f. i)i<s- It then pulls one sample per sub-stratum in JV^. for 
each ilfe. 

It then sub-stratifies again each stratum ilk using the informations collected. It sub-stratifies each 
stratum ilk in d -^L- 

-^{n-KS) 



Sk = max ■ 



(7) 



hyper-cubic strata of same measure (see Figure [T] for a definition of A). It is possible to do that 
because by definition, S]/''' is an integer. We call this sub-stratification of stratum ilk stratification 
Nk = {ilk,i)i<Sk - III the last Equation, we compute the empirical standard deviation in stratum Vlk 
at time KS as 



^k,KS 



1 1 

\ 1=1 j=i 



(8) 



Algorithm LMC-UCB then samples in each sub-stratum ilk^i one point. It is possible to do that 
since, by definition of 5*^, J^k '^k + KS < n 

The algorithm outputs an estimate fin of the integral of /, computed with the first point in each 
sub-stratum of partition TVfc. We present in Figure[T]the pseudo-code of algorithm LMC-UCB. 



Input: Partition {ilk)k<K, L, S, set A = 2LVd^log{2K/5) 
Initialize: Vfc < K, sample 1 point in each stratum of partition A/^ 
Main algorittim: 

Compute Sk for each k < K 
Create partition A/fc for each k < K 
Sample a point in ilk G A/fc for i < Sk 

Output: Return the estimate fl„ computed when taking the first point Xk,i in each sub-stratum ilk,i of 
Aft, that is to say /t„ = Y^k=i "^k I]fii -if- 



Figure 1: Pseudo-code of LMC-UCB. The definition of A/^,, S, Nk, iik.i and Sk are in the main text. 

4.2 High probability lower bound on the number of sub-strata of stratum ilk 

We first state an assumption on the function /. 

Assumption 1 The function f is such that V/ exists andyx e [0, 1]'', ||V/(a;)||2 < L. 

The next Lemma states that with high probability, the number Sk of sub-strata of stratum ilk, in 
which there is at least one point, adjusts "almost" to the unknown optimal proportions. 

Lemma 3 Let Assumption Ujbe satisfied and {ilk)k<K be a partition in K hyper-cubic strata of 
same measure. Ifn > AK, men with probability at least 1 — 5, Vfc, the number of sub-strata satisfies 



Sk > 



\K.k \n - 7{L + l)rf3/Vlog(if/<5)(l + -L)K^n^ 



.S 



The proof of this result is in the Supplementary Material (Appendix [C]l. 
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4.3 Remarks 

A sampling scheme that shares ideas with quasi Monte-Carlo methods: Algorithm LMC — 
UCB almost manages to divide each stratum fi^ in \K,kn hyper-cubic strata of same measure, each 
one of them containing at least one sample. It is thus possible to build a learning procedure that, at 
the same time, estimates the empirical proportions XK,k, and allocates the samples proportionally to 
them. 

The error terms: There are two reasons why we are not able to divide exactly each stratum 17^ 
in XK,kn hyper-cubic strata of same measure. The first reason is that the true proportions XK.k are 
unknown, and that it is thus necessary to estimate them. The second reason is that we want to build 
strata that are hyper-cubes of same measure. The number of strata Sk needs thus to be such that 

Sl^"^ is an integer. We thus also loose efficiency because of rounding issues. 

5 Main results 

5.1 Asymptotic convergence of algorithm LMC-UCB 

By just combining the result of Lemma [T] with the result of Lemma |3] it is possible to show that 
algorithm LMC-UCB is asymptotically (when K goes to +oo and n > K) as efficient as the optimal 
oracle strategy of Lemma [T] 

Theorem 1 Assume that V/ is continuous, and that Asswnption^is satisfied. Let {^'^)n,k<K„ be 
an arbitrary sequence of partitions such that all the strata are hyper-cubes, such that 4i4'„ < n, such 

that the diameter of each strata goes to 0, and such that lim„_s.+oo ^ ( Kn[ log(ii'„n^)) ^ ) — 0- 

The regret of LMC-UCB with parameter Sn — ^ on this sequence of partition, where for sequence 
{^]^)n,k<K„ it disposes of n points, is such that 



lim n^+^/'^RniALMC-UCB) = 0. 
n— f oo 

The proof of this result is in the Supplementary Material (Appendix [D]l. 
5.2 Under a sUghtly stronger Assumption 

We introduce the following Assumption, that is to say that / admits a Taylor expansion of order 2. 

Assumption 2 / admits a Taylor expansion at the second order in any point a G [0, l]'^ and this 
expansion is such that\fx, \ f{^) ~ ~ f-. ~ ^ M\\x — ajlj where M is a constant. 

This is a slightly stronger assumption than Assumption [T] since it imposes, additional to Assump- 
tion[r| that the variations of \7f{x) are uniformly bounded for any x G [0, l]'^. Assumption |2] im- 
plies Assumptionfllsince |||V/(a;)||2-||V/(0)||2| < Af||a;-0||2, whichimpliesthat ||V/(a;)||2 < 
||V/(0)||2 + MVd. This implies in particular that we can consider L = ||V/(0)||2 + M^d. We 
however do not need M to tune the algorithm LMC-UCB, as long as we have access to L (although 
M appears in the bound of next Theorem). 

We can now prove a bound on the pseudo-regret. 

Theorem 2 Under Assumptions^and^ if n > AK, the estimate returned by algorithm LMC — 
UCB is such that, with probability 1 — o, we have 

Rn{ALMC-ucB) [m{L + + {(y^M'''^ ^\og{K/5)K^^n-^^ + 25d(-^) ^tt)] . 



n 



A proof of this result is in the Supplementary Material (Appendix [E| 

Now we can choose optimally the number of strata so that we minimize the regret. 

Theorems Under Assumptions ^ and ^ the algorithm LMC — UCB launched on K„ 

d 

{y/riy/'^ hyper-cubic strata is such that, with probability 1 — S, we have 

RniALMC-ucB) < , [700Af(£ + 1)4^3/2(1 + ^y^iog(n/5) 

<i + 2(£i+l) L \ Zj / 
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5.3 Discussion 



Convergence of the algorithm LMC-UCB to the optimal oracle strategy: When the number 
of strata Kn grows to infinity, but such that lim„_j.+oo ^ ( Kn ( log{Knn^)) ^ ) ~ 0' '^^e pseudo- 
regret of algorithm LMC-UCB converges to 0. It means that this strategy is asymptotically as effi- 
cient as (the lower bound on) the optimal oracle strategy. When / admits a Taylor expansion at the 
first order in every point, it is also possible to obtain a finite-time bound on the pseudo-regret. 
A new sampling scheme: The algorithm LMC — UCB samples the points in a way that takes 
advantage of both stratified sampling and quasi Monte-Carlo. Indeed, LMC-UCB is designed to 
cumulate (i) the advantages of quasi Monte-Carlo by spreading the samples in the domain and (ii) 
the advantages of stratified, adaptive sampling by allocating more samples where the function has 
larger variations. For these reasons, this technique is very efficient on differentiable functions. We 
illustrate this assertion by numerical experiments in the Supplementary Material (Appendix [Ajl. 
In high dimension: The bound on the pseudo-regret in Theorem [s] is of order n^^^i x 
poly{d)rr . In order for the pseudo-regret to be negligible when compared to the opti- 
mal oracle mean squared error of the estimate (which is of order ri^^^3) it is necessary that 
_ I 

poly{d)n 2(£i+i) is negligible compared to 1. In particular, this says that n should scale exponen- 
tially with the dimension d. This is unavoidable, since stratified sampling shrinks the approximation 
error to the asymptotic oracle only if the diameter of each stratum is small, i.e. if the space is stratified 
in every direction (and thus if n is exponential with d). However Uniform stratified Monte-Carlo, 
also for the same reasons, shares this problerrj^ 

We emphasize however the fact that a (slightly modified) version of our algorithm is more efficient 
than crude Monte-Carlo, up to a negligible term that depends only of poly{\og{d)). The bound in 
Lemma [3] depends of poly{d) only because of rounding issues, coming from the fact that we aim 
at dividing each stratum ri^ in hyper-cubic sub-strata. The whole budget is thus not completely 
used, and only + KS samples are collected. By modifying LMC-UCB so that it allocates 

the remaining budget uniformly at random on the domain, it is possible to prove that the (modified) 
algorithm is always at least as efficient as crude Monte-Carlo. 

Conclusion 

This work provides an adaptive method for estimating the integral of a differentiable function /. 
We first proposed a benchmark for measuring efficiency: we proved that the asymptotic mean 
squared error of the estimate outputted by the optimal oracle strategy is lower bounded by Yi^^j-^yi- 
We then proposed an algorithm called LMC-UCB, which manages to learn the amplitude of the vari- 
ations of /, to sample more points where theses variations are larger, and to spread these points in a 
way that is related to quasi Monte-Carlo sampling schemes. We proved that algorithm LMC-UCB 
is asymptotically as efficient as the optimal, oracle strategy. Under the assumption that / admits a 
Taylor expansion in each point, we provide also a finite time bound for the pseudo-regret of algo- 
rithm LMC-UCB. We summarize in Table[T]the rates and finite-time bounds for crude Monte-Carlo, 
Uniform stratified Monte-Carlo and LMC-UCB. An interesting extension of this work would be to 



SampUng schemes 


Pseudo-Risk: 

Rate Asymptotic constant 


+ Finite-time bound 


Crude MC 




fdx +0 


Uniform stratified MC 


1 1 


(/[o,i].l|V/(x)||idx; 


■n/+ d + 2d 


LMC-UCB 




+o{ . ) 



Table 1: Rate of convergence plus finite time bounds for Crude Monte-Carlo, Uniform stratified 
Monte Carlo (see Lemma|2| and LMC-UCB (see Theorems [T] and |3]l. 



adapt it to a— Holder functions that admit a Riemann-Liouville derivative of order a. We believe 
that similar results could be obtained, with an optimal constant and a rate of order 77,1+2"/^ 

Acknowledgements This research was partially supported by Nord-Pas-de-Calais Regional Coun- 
cil, French ANR EXPLO-RA (ANR-08-COSI-004), the European Communitys Seventh Framework 
Programme (FP7/2007-2013) under grant agreement 270327 (project CompLACS), and by Pascal-2. 

'when d is very large and n is not exponential in d, then second order terms, depending on the dimension, 
take over the bound in Lemmapl (which is an asymptotic bound) and poly{d) appears in these neghgible terms. 



8 



References 

[1] J.Y. Audibert, R. Munos, and Cs. Szepesvari. Exploration-exploitation tradeoff using variance 
estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876-1902, 2009. 

[2] A. Carpentier and R. Munos. Finite-time analysis of stratified sampling for monte carlo. In In- 
Neural Information Processing Systems (NIPS), 201 la. 

[3] A. Carpentier and R. Munos. Finite-time analysis of stratified sampling for monte carlo. Tech- 
nical report, INRIA-00636924, 2011b. 

[4] Pierre Etore and Benjamin Jourdain. Adaptive optimal allocation in stratified sampling methods. 
Methodol. Comput. Appl. Probab., 12(3):335-360, September 2010. 

[5 J P. Glasserman. Monte Carlo methods in financial engineering. Springer Verlag, 2004. ISBN 
0387004513. 

[6] V. Grover. Active learning and its application to heteroscedastic problems. Department of 
Computing Science, Univ. of Alberta, MSc thesis, 2009. 

[7] A. Maurer and M. PontU. Empirical bemstein bounds and sample-variance penalization. In 
Proceedings of the Twenty-Second Annual Conference on Learning Theory, pages 115-124, 
2009. 

[8] H. Niederreiter. Quasi-monte carlo methods and pseudo-random numbers. Bull. Amer Math. 
Soc, 84(6):957-1041, 1978. 

[9] R.Y. Rubinstein and D.P. Kroese. Simulation and the Monte Carlo method. Wiley-interscience, 
2008. ISBN 0470177942. 



9 



Supplementary Material for paper 

Adaptive Stratified Sampling for Monte-Carlo 

integration of Differentiable functions 

A Numerical Experiments 

We provide some experiments illustrating how LMC-UCB works, and compare its efficiency to that 
of crude Monte-Carlo and Uniform stratified Monte-Carlo. 

We first illustrate on an example, in Figurel2] the sampling scheme. We have launched LMC-UCB on 
the function displayed in Figure|2](i.e. f{x)= sin(l/(a; + 0.1))+I {x > 0.9} sin(l/(a;-0.7))). We 
chose this function since its variations are quite heterogeneous in the domain [0, 1]. We considered 
a budget of n = 100, and took as parameter A — 10. if„ and S are defined as in Figurefl] 



Position of tlie samples collected by LMC-UCB (for n=100) 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 

X 



Figure 2: Position of the samples collected by LMC-UCB. 



We observe that, as expected, the algorithm allocates more points in parts of the domain where the 
function has larger variations and, additional to that, it spreads the points on the domain so that every 
region is covered (in a similar spirit to what low-discrepancy schemes would do). 

We also compare, for this function, the mean squared error of crude Monte-Carlo, uniform stratified 
Monte-Carlo and LMC-UCB, for different values of n. We average the mean squared error of the 
estimate returned by each method on 10000 runs. We have the following performances for each 
method (displayed in Figures |3] and |4]|. 



As expected, the mean square error decreases faster than 1/n for uniform stratified Monte-Carlo and 
LMC-UCB. These methods are also more efficient than crude Monte-Carlo (up to 100 times more 
efficient on this function), which makes sense since the function that we integrate is differentiable 
(and then the rate for LMC-UCB and Uniform stratified Monte-Carlo is of order 0{n^^^'^^'')). The 
gain in efficiency when compared to crude Monte-Carlo however decreases with the dimension, as 



explained in Subsection 5.3 We observe that LMC-UCB is more efficient than uniform stratified 



Monte-Carlo, which is a minimax-optimal strategy in the class of non-adaptive strategies. 



B Poof of Lemma [T] 

Step 0: Decomposition of the variance Let fl = {fl^^)o<n<+oo.k<n be a sequence of partitions 
of [0, l]"* in n hyper-cubic strata such that the maximum diameter of the strata in the partitions 
converges to when n goes to infinity. In each of those strata, there is a point. 

Let n be the number of points, and A; < n be an index. Let a„ jt be a point of the stratum fl^. Let 
us assume that / is differentiable, that it's derivative V/ is continuous, and let us also assume that 
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■crude MC 
-uniform stratlHed MC 
-LMC-UCB 



Figure 3: Mean squared error w.r.t. the integral of 
/ of crude Monte-Carlo, uniform stratified Monte- 
Carlo and LMC-UCB, in function of the budget 
n. Since crude Monte-Carlo is approximately 100 
times less efficient than the two other strategies, 
their curves are shrinked and not very visible. 




Figure 4: Zoom on the mean squared error 
w.r.t. the integral of / of uniform stratified Monte- 
Carlo and LMC-UCB, in function of the budget 
n. 



= Ef=i is such that / \\Vf{x)\\ldx is bounded. In that case, Va; e Sl^, there 

exists Un,k,x G s'ich that we have f{x) — f{ak) = f{un,k,x), x — an,k) (intermediate values 
theorem). Note also that we have in that case /U„,fe = f{an,k) + ^i^*-^ Jqu {^f{un,k,x),x — an,k)dx 
where an,k is the center of the stratum Cl^. We thus have: 



'^l,k=7— I {f{x)- f(an,k)fdx 



Wn,k Jn 



= ^— [ ({'^f{Un,k,x),X - an^k) — [ {'^f{Un,k,y),y - an,k)dy) dx 

1^n,k JO" ^ Wn.k JO" ^ 

= ^— / {{"^ f{un,k,x),x - an,k)) dx-(-^ [ {V f{un,k,y),y - an,k)dy) 
= — [ ({Vfiun,k,x)l{^k},ix-a„,k)l{m)ydx 



[— [ {Vf{un,k,y)iM},{y-an,k)iM})dyY- 



Step 1: Convergence of ak when the size of the strata goes to Let a; e [0, 1]*^. Note that as as 

(f^/J)fe<n is a partition, there is a kn.x such that a; € 

Note first that V/ is continuous. This means that Ve, 377/ Vy G B2{x,r]), ||V/(?/) — V/(a;)||2 < e. 
Let e > and n sufficiently large (any n larger than some given horizon n'), the maximum diameter 
of il^' is smaller than rj. Let y G fi^ .As Un^k„ ^ ^fe ' '^^ know that | |M„,fe„ x,!/ ~ ^1 1 — V 

and that we thus have \ \^ f{un,kn,^,y) ~ V/(a;)||2 < e. This means that V/(wn,fe„,x,?/) converges 
point- wise to V/(a;). 

Note also that we have by Cauchy-Schwartz that 

^((V/K,fc„,,,y),(2/-a„,fe„_J)) ^ l|V/(Mn',fe„,,„y)||i||y-an,fc„,J|p{ 

<d\\Vf{un,k^^^,y)\\l<dL\ 



lid 
wJk 
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As V/(w„,fe„^,y) converges point-wise with n to V/(x), and as — i{^f{un,k„^,y),{y - 

X 2 

an,k„,x )) ) — we have by the Theorem of Dominated convergence, that 

((V/K,fe„,.,,), {y - an,fc„.J))'l {^L,^}dy 



w]^^l'^^ J[o,iY 
Mm — 



^ liv/(^)lli 

12 

In the same way, we have that 
lim 



((V/(«n,fe„..,,), {y - a„,fc„, J)I {l^L,. } dy) 



0. 



Let us call gn,n{x) = J2k=i i/2d ^{^k} (^) ~ 1)2^'" • The last two inequalities prove, Va;, 
point- wise convergence of gn,n{x) to il^iMik ; 

Step 2: Optimal allocation and minimum for the asymptotic variance There is one point pulled 
at random per stratum. The variance of the estimate given by such an allocation is 

n n 2 

1+2/d "n.k 
' X 



E2 2 „ 1+^ 

fc=l /c=l '"'n.fe 

Define Snfl{x) = ELi ^^^^ {"fel i^)- Note first that 

1 " [ 



fe=i 

and that 



Sn,Q.{x) > 0. 

One has also for the variance of the estimate that 



2 2 1 /" 1 

/ . '^n,k'^n.k = ,,i-i-9/w / / M+2/(;'^^- 



12 



By using the result of the previous step, one has (for every sequence where the diameter of the 

12 



Strata converge uniformly to 0), point-wise convergence of gn,n{x) to ^^"^^j^^^^^ when n goes to 
infinity. 

This leads to, by using Fatou's Lemma 



lim inf / gnM^) f \i+2/d '^^ 

>f inf MMl_i_,,. 



[0,1]' 



s>oJs=i 12 s(x)i+2/'' 



One thus wants then to find the function s{x) that minimizes this limit. One thus wants to solve in 
each point x the program inf ^ ^^^^^^"^^^^ such that s > and J^^ s{x)dx = 1. 

The solution (by just writing Lagragian) is 



By plugging it in the bound, one obtains 

lim inf / gnM^) f \i+2/d ^^ 



^ (/[o,i]<'(l|V/(x)||2)^dx 



12 

Note that the previous result holds for any sequence of partitions where the diameter of each 

stratum converges uniformly to 0. One finally has, using that, that the minimum possible asymptotic 
variance is bounded by 



/[0.1].(l|V/(x)||2)^dx 



lim infni+2/^^<,<,> 

k=l 



and we thus obtain the desired result. 



C Proof of Lemmas |3] 

Upper bound on the standard deviation: The upper confidence bounds Bk,t used in the MC- 
UCB algorithm is an elaboration in the specific case of Lipschitz function on Theorem 10 in [7J (a 
variant of this result is also reported in [1 1). We state here a main Lemma. 



Lemma 4 Assume that the function f from which the data is collected is differentiable, and that 
||V/(a;)||2 is bounded by L, and n > 2. Define the following event 



n 



l<k<K. 



\ 



1 



^3 = 1 



^fc^i/,^/ log(2i^/^) 
S 

(9) 



The probability of^ is bounded by 1 — S. 



Note that the first term in the absolute value in Equation|9 is the empirical standard deviation of arm 
k computed as in Equation[8]for t samples. The event ^ plays an important role in the proofs of this 
section and a number of statements will be proved on this event. 
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We now provide the proof of Lemma |4] 

Let us assume that / is such that ||V/||2 < L. Let us consider a small box Q,^ of size w and such 
that = ntifa* - i4^,a, + As ||V/||2 < L, we know that \f{x) - ^ J^^ fiu)du\ < 

If [/ is a random variable on ilu, and X = f{U), then 

\X -fi\< L^/dw^l'^, 

where = ^ J^^ f{u)du. 

Note first that for algorithm LMC-UCB, the S first samples are each sampled in an hypercube of 
measure ~, and all of those hypercubes form a partition of the domain. 

Using a large deviation bound on the variance, e.g. the one in |7 1, we can deduce that with probability 

1-25 



\ i=i j=i 



^ 21og(l/(5) 
5-1 ' 



where 6 is a bound on the random variables X,; — /ii . One gets because jX^i — < \/dL{'^Y/'^ 
(where ^k.i is the mean of the function on the hypercube where point Xk_i is sampled and because 

t> 2 ^ 



\ i=l j=l 



Then by doing a simple union bound on (k, t), we obtain the result. 
The following Corollary holds. 

Corollary 1 On the event Vfc < K, 



l/d 

\(Tk,KS -^k\< 2LVd^\ogi2K/S)^ 

S 2d 



By concavity, we also have the following Corollary. 
Corollary 2 On the event ^, there is \/k < K that 



d d oi," 

I ~ d + l _ ^<i+l I < 4 



_ d+2 1 



w/iere A = {2L^/d^\og{2K / 5)y 



The number of sub-strata Let k be an index. 



Er=i (-.K 

Stratum 17^ is subdivided in Sk = max 



n-KS). 



S,[Cl^'^\'^ substrata, composing the sub-partition A/fc. 
that have the less points, it ensures that there is at least one point per sub-stratum. 



Let us call Ch 



Note first that X^s^i ^k < n as X^feLi ^fc ~ n—KS. As the samples are always picked in sub-strata 
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On ^, we have because of Corollary |2] that 



\ S2(d+1) / 



-{n-KS) 



> 



J (n-KS) 



+ 2 



2yl 



> 



KS 



2An 



- d+2 
J]^5'2(d+l) 



Using the fact that (f ) "+1 > > (^(f ) "+1 - ij > 

2An ^x,^ 



_n ^ d+1 
K ) 



in the last Equation, 



/ n d 9 An K d „ d+2 , d+2 s 



1 + 

2An^ (^+1)^ ''(''+2) 



> Aft-jfe n - ii:<i+in<i+i if 2(d+i)^ (1 + W(_)d+i 2(d+i) ) 



I A K d + 2 1 ^ \ 

> A^fc (1 + 2-; h d( — ) "(''+1'" )K^n^ 



where the last line comes from the fact that n > K. 
We also have 



Ck - \cl'^Y' < Ck - {c'J'' - If = Cfe(i - (1 



1/d 



(10) 



From the last Equation, the definition of Sk and Equation 10 we deduce that (rounding issues) 

d 



Sk > max 

> max 

> max 

> max 



S,Ck{l 



1/d' 



S,Ck{l 



5, A 



K. 



I A K d+2 1 d \ / 1 \ 

k(n~{l + 2— + d{ — )^(^)K^n^ ) (l - d{—) ) 



A 



S, Aiffcfn- (2 + 2-^ +d)if3TTn3TT 



We call N = n - (2 + 2^^ + d)K^n^ in the sequel. Note that Vfc, we have Sk > 
max[5, Ax^feiV]. 

Note also that for (5 < 1, we have 



A ^ {2LVd^/]og(2K/6))^ 
< 4(L + l)^0og(if/(5). 



We thus have that 



> N >n-7{L + l)d^^^^/\og{K/5){l + -^jK^FTn^. 



(11) 
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con- 



D Proof of Theorem [l 

Step 1: Notations Let ((17^) 

k<K,J be a sequence of partitions in hyper-cubic strata of same 
measure. Let us also assume that the number of strata Kn in partition {^2)k is such that 
lim„_j._|_oo Kn — +00 and lim„_j.oo — — „d+i — = 0- On each of those partitions, MC — UCB 
is launched with respectively n samples and parameter 5n — 

The number of hyper-cubic sub-strata built by the algorithm in stratum fiJJ is Sn.k- Let us write 
(((^fc s)s<Sn k) k<K ) partition in hyper-cubic strata formed with those sub-strata. By con- 
struction of the algorithm, there is at least one point per sub-stratum. The estimate of the mean of 
the function is built with the first point in each of those sub-strata. 

Let us write g^^\x) = Eti Ett ^} {^1,} (^) = Eti Eti <k/^^ {^h} (^)- 
From step 1 of the proof of Lemma [T] it converges with n (because Kn +00 when n 00 and 
thus the diameter of each stratum goes to 0) point-wise to iiX/Mik . 

Let us write gn\x) — X^a^i ~T/id^{^k} (x). From step 1 of the proof of Lemma jl| it 
verges with n point-wise to llZZMik -pjjjg convergence implies, as ||V/||| is bounded and thus 

d 

as /||V/||2^^ is bounded, by the Theorem of Dominated convergence that lim„^_|_oo Sa-^ = 

Define A„(x) = Xf^"^^I{17^} = Ef^i '^Zylf' im = i^'^ifjf^ . We thus 
know, as the limit of (I]i<-^)„ exists and is bigger than 0, that Xn{x) converges pointwise to s{x) — 

Let us also define s„(a;) = Y.k=i 1^^^ i^k) i^)- 



Step 1: Majoration of of Let us consider only functions / that are not everywhere constant 

on the domain, as otherwise the bound on the pseudo-risk is triviaj^ Then 3X € [0, 1]'' such 
that X is measurable and such that J^l > 0, and such that Vx £ X, ||V/(a;)||2 > 0. Then 

Let Nn be defined as in the proof of Lemma [3] i.e. Nn as in Equation 11 As 



lim„^+oo = Jjjj j^jd(J^^^^^iLL2 ) (''+1) c?x, we know that for any ri sufficiently large, lim„ Sx„ > 
I /[o.i]^(^^^T^)^'^^- We thus have 

n>Nn>n-7{L + l)d^/^ ^\og{Kn/5n)il + -^)K^n^ 

>n- Cy/\og{Knn^)K^n^, 
with C < +00 as /jg j]d(-^^^4^^^) > 0. As by definition of the sequence of partitions, 

lim„^+oo Vlog(if„n2)(^) ^ = 0, we know that lim„^+oo ^ = 1- 
By Lemma[3] with probability 1 — (5„, Vfc, Sn.k > XK„,kNn- We thus have 

1 1 I , n ,\ ^ 



S,iix) Xn{x) Xn{x) Nn 



*If the function is everywhere constant, the samples are always equal to the integral, and the pseudo-risk of 
the estimate is zero. 
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which leads to 



1 ^ 1 n 



Let A'+ = {x e [0, 1]'' : ||V/||2 > 0}. By the last Equation, Ve > 0, Vx G X+, for n sufficiently 
large (3n' such that Vn > n'), F{^-^>e) <6n. Note that ^+^1 S„ = EnZ h < 
We can thus use Borel-Cantelli's Theorem and this gives us that on , lim sup„ ~ A"11r) — 
a.s.. 

We thus deduce (i) by the definition of A„ and the fact that it converges almost surely to s and (ii) by 
the fact that lim„ ^ = 1, that limsup„ \ \x) — Tlx) ^-^^ (since, by definition, s„(x) > ^ > 
0). 

From that we deduce that Vx G X~^, limsup„ i-j^ < a.s.. As on [0, 1]'' — X~^, s{x) = 0, we 
have Va; G [0, 1]'', that limsup„ < a.s.. 

Step 2: Convergence rate of the pseudo-risk. The pseudo-risk of the estimate /i„ is 



2 _ „l+2/d 
n,k.s — 



[0,1]'' s„(a;)i+^/'* 



On [0, 1]'', g^n^ converges pointwise to ^i^^, and limsup„^+^ ^^^(^)\+2/d < ^(^)iV2/d a.s. We 
finally have by Fatou's Lemma that 



< / limsupgi^)(a;)limsup — 



< 



l|V/||i 1 



-Ax. 



By plugging in the last Equation the Definition of s, we conclude the proof 

E Proof of Theorems |2] 

Step 0: Some inequalities when the second derivative of / is bounded Let a be a point in fi. 

/ admits a Taylor expansion in any point. For any a; G have \](x) — f{a) + V/(a).(a; — a)| < 
M\ \x — a\\2 with 2M a bound of the second derivative of /. 
Notealsothat||V/(a;)- V/(a)||2 < A//||a; - a||2. 



Note also that 



l|V/(x)||^-||V/(a)||2 < (||V/(a;)||2)'-||V/(a) 



< 
< 



(||V/(a)||2 + M||x-a||2)'-||V/(a) 



||V/(a)||2 + 2A/||V/(a)||2||x - alb + M'\\x - a\\l - \\Vf{a)\\l 



< 2M||V/(a)||2||x -a\\2 + M^\\x - a\ 



This means that 



l|V/(a:)||2-||V/(a)||2 



< M\\x - a\\2. 



(12) 
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Step 1: Variance on a small box Let us place us on one small box of size w and such that the 

^ a- + 

2 ' ^ 2 



corresponding domain is flw = Hl'^i ~ '^i + ^^^2~]- ''^'^ ^ Taylor expansion in a and 



have 

\fix) - f{a) + V/(a)(.T - a)| < M||x - a\\l 
with 2Af a bound of the second derivative of /. 
Note that because of the previous equation 

I- / (.f{u)- f{a) + ^f{a){u~a))du\<- [ \f{u)-f{a)+^f{a){u~a)\dt 



< Ml la; - a\ 



This implies because Oi = j"^' ^^^^ uduthat 



f{u)du- f{a)\ < M\\x-a\ 



Finally, by combining Equations [13] and [14] we get 

w Jn 

Triangle inequality on the last Equation leads to 

1 

w Ja 

This means by integrating that 

, 1 



fiu)du + Vf{a){x-a)\ < 2M\\x - a\ 



\f(x) - ^ / f{u)du\ < \Wf{a)(x - a)\ + 2M\\x - a\\l. 



(fix) - - [ f{u)du) 



dx < 



< 



J2„ 

-2M 
■4M2 



(|V/(a)(x - a)\ + 2M\\x - a\\iy dx 
\ 2 

V/(a)(a; — a) ) dx 



(v f{a){x — a)\j\\x — a\\2dx 
\\x — aWidx. 



Note first that because Ui = J udu, we have for the term in Equation 

/ (v/(a)(a;-a)) dx ^ ( J] W(a)»(a;, - a.)) dx 
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1=1 

d 



W 



i=l 



1/d 
l + 2/(i 



Vf{afi{xi - aifdxi 



1=1 



12 



12 



■||V/(a)||i 



Now note that for the term in Equation 17 



\x — a\ Indx 



(13) 



(14) 



(15) 
(16) 
(17) 



(18) 



(19) 
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Now note that because of Cauchy-Schwartz and by using Equations 18 and 19 we have for the term 
in Equation [T6] 



(v/(a)(x - a)|) ||x - aWldx < ^ (v/(a)(x - a)\)'dx^j^^^ ||a 



< ||V/(a)||2«7^/^+^/Vd2t(;i+4/d 
<d\\Vf(a)\W+^'''. (20) 



We thus have by combining Equations [15] [T6| [17 18 20 and 19 

|2 



0„ ^ 

This leads to using Step in Proof[B 



^2^2 < ^MM^2+2/d^2Md||V/(a)||2w'+^/'* + 4Af2dW+4/'^ 

^^2+2/rf(M(«^+2Md«;i/'^)'. (21) 
2v 3 

In the same way, one can prove 

w^a^ > ^;2+2/rf( l|V/(a)||2 _ 2Mdii;i/'^)l (22) 
2v 3 

Step 2: Majoration on the strata Lemma|3]tells us that with probability 1 — S (i.e. on the event 



Q, each stratum ilk is partitioned in Sk > max 



Xp,kN,S 



hyper-cubic substrata flkj of same 
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measure, and that that there is at least one sample per stratum.The measure of those sub-strata is 
thus Wk.t = f^. 

We have for stratum ^ by using Equation [2T[ 

22^ 2+2/d J|V/(afe j)||2 i/d\2 

Wk.^<^k,^ < Wk^, { ^ +2Mdw^'^^ ) , 

where Ofc j is the center of stratum ilk,i- 

Let Ck,i be a point in flk.i such that ^ — arg miricgo^ ^ 1 1 V/(c) 1 12. By using that and Equation 
we get that the variance on strata k that is bounded by 

E^'' 2 2 ^ 2+2/d/ ||V/(afe i)||2 i/<iN2 

i=i i=i 

< ^2+2/.( ||V/(c^OI|2 ^ 3Md^V/)^ 
i=i 2v3 

<_^fc V^.,^JIW(cm)I|2^,,,,^,„1/.,2 



i— 1 ^ 



Let us call g{x) — + 3Mdwl^^. As Wk > Wk,i, and ||V/||2 is positive, we have 



2\/3 

E<.-L<^E«^.fff(c..)^. (23) 



2—1 Z— 1 
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Step 3: Minoration of the number of sub-strata in each stratum By setting Equation [21] to the 
power 2{d+i) ' ^® S®*- ^'^ stratum flk that 

Let c™ be a point in ilk such that c™ = argmincgOfc ||V/(c)||2. Note that this implies that 
EtiM^-^^fiS^ + ^Mdwl'")^- < + 3Mdwl/')^du. By using that 

and Equation 12 we get that T,k — X]fe(wfcffc) ^ is bounded as 



d+1 

k J 



k=l 



2^3 



'[0,1]'' 



< / g{u)^du. (24) 

'[0,1]'' 



In the same way, we can deduce 

'[o,i]'i 2%/3 



> / ( " - Uldwl!") "^'du. (25) 



Let c^^ be a point in ^k such that c^^ = argmaxcgns, ||V/(c)||2. For a stratum k, by using 
Equations |22] and [12] 

2v 3 

As for any m > and a > one has (1 — u)^" > 1 + ctu, the last Equation leads to 

1 1 

< 



K^.) ^ (^^f^ + ^Mdwl/" - MId{wl" + wl'')) ^ 

1 



< 



1 

< 



1 , 1 9Mdw}/'^ . 



d+2 \ d + 2 I 2d+3 

' (g(cf )) ''+^ (5(cf )) 



As Wfe i = ^ this leads with the last Equation and Equation 



24 



V iV y 9(0*0"+^ (.9(cf))''+i 



20 



Step 4: Bound on the pseudo-risk As c^'^ — maxcgo^ l|V/(c)||2 and Ck,i = 
mincgnfc,. ||V/(c)||2, and as g{x) = "^^^^^H" + SMdwl^'^, we have for any (a, &) > that 



gf^^M^b < mincgnfc.i ff(c)" ■ By using that and Equations 



23 



and 26 



1=1 



I j c V d+2 + , 2d+3 )9[Ck,i) 

y N ^ Skfrt (g(cf))^ (5(cf))^ 

< ^ — > ( mm (7(c) 'i+i + mm " * 



< 



'^k ~( cGOfc.i ceSlfc^i ((7(c)) "^+1 

Note also that by definition, ^(x) > 3Mdw],^'^. From that and the previous Equation, we deduce 

<( ^'°'^' ) " w,{- / g{u)^du + 9Mdwr)- 

Finally, by summing over all strata and because all strata have same measure Wk ~ 



/ m lid 9 " ^ ^ f d 

EE<^<^^( V ) E(/ 5N^rf- + -.x9AMu;, 

1=1 i=l A:=l "^^k 

/ frn lid (g(M))^dUN ^ /■ d 1 1 

<( ^'°'^' ' ) ' {l^^j{u)^^du + 9Md{^)-^) 



d + 1 



I / f d 2(d+i) r , d+2 1 1 \ 

(27) 



Step 5: Bound on Jj^ (7(u) du Note that because < 1, we have 



2v 3 



We thus have 



g{u)^^du< f ( Il^/(")ll2 -)d+i^^^3j^,j^^d+i_ (28) 
[0,1]'' "'[0,1]'' 2V3 



Note also that for a; > 0, and as ^^'^j'"'"-' < 4, we have 



21 



Letuscalll] = /jg ^,, ( 
get 



li^Mjillh'^ d+i Then by applying the previous result to Equation 



28 



we 



2(d+l) 



^J[o,i]<i ^ ^"'[0,1]'' 2V3 



2{d + l) , 

= 1 



2\/3 
3Md 



2(d + l) 



16E^2^ 1 



-J 



(29) 



Note also that by Equation 12 we know that ||V/(u)||2 < ||V/(0)||2 + AfA/(i. From that we deduce 
that 



[0,1]'^ 



g{u)^du<j: + ?,Mdw^+' 
< E + 3Af d. 



(30) 



Step 6: Final bound on the pseudo-risk From Equations 27 29 and 30 we deduce 

K 5, 



TTT ~T iVT" ^ "'10,11'' "'fo.il'' ^ 



N d 



E — 



16E^i^ 1 



[0,1]'' 

1 

d+l 



< 



< 



d+2 1 1 

-%Md{T, + 2,Md) " ij^)"*' 
1 



jy d 



N d 



2(d+i) 2(d+i) / 3Md\ 4,1, 1 

E^+25Md(E + l)^(l + ^j (-^)''+^ 



2(d+l) . 1 , 1 

E^+C7(-)-^ 



where C = 25Md(E + 1)^ (l + ^) 



Note that iV = n- {2 + 2^ + d)K ''+^ = n - BK •'+^ , where B = 2 + 2^ +d. From 
plugging that in the last Equation, we get 



1 



<- 



n - BK'i+^n'i+^ 
1 



l d M 



< 



< 



I 1 - BK-i+^n t^+i 

1 



2(d+l) , 1 , J_ 

E^+C(-)-^ 



2(d+l) , 1 , 1 

E^+C(-)-^ 



d+2 

n d 



,d + 2, _j_ 1 
l + ( ^ )BKd+in "i+i 



2(d+l) , 1 . 1 



1 



d+2 

n d 



2(d+l) 2(d+l) 1 1 / 1 X 1 1 

E^3— + 3E^^Bi\:3+Tn"d+T + c^^) + 3SCn-3n 



where we use for passing from the second to the third line of the Equation that (l — u) " < 1 + au. 



By it's definition, C > E d and this leads to 



K Sk 



k.i — d+2 

n d 



2(d+l) 1 1 / 1 X 1 

E^^ + GBCK^n^^ + C(— ) 



(31) 
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Note first that by Equation 25 and because 1 1 V/| I2 < i we have 

■\\yf(.u)\\ 



J[o,i]'' 2v3 



From that we deduce that 

ML+l)Vd^log{K/S) 



S < 2 + 2- 



S - 3LMdw^+' 



< 10(L + l)Vd,/log{K/S)il + ^). 



i+l^J d+1 



By plugging in Equation 3 1 the definition of C and the bound on B computed above, we obtain 

E E <^ + 650M(L + + f^)'v/log(ifMif ^ 

1=1 1=1 

+ 25Md(E + 1)^(1 + ^)^1)^; 
This concludes the proof. 



23 



